Solutions for 05. Simple linear regression: model evaluation

Notes and in-class exercises

# Load packages and import data
library(readr)
library(ggplot2)
library(dplyr)

bikes <- read_csv("https://mac-stat.github.io/data/bikeshare.csv")

Exercise 1: Is the model correct?

The blue curved trend line shows a clear downward trend around 85 degrees, which contextually makes plenty of sense—extremely hot days would naturally see less riders. Overall the combination of the upward trend and downward trend makes for a curved relationship that is not captured well by a straight line of best fit.

ggplot(bikes, aes(x = temp_feel, y = riders_registered)) + 
    geom_point() + 
    geom_smooth(se = FALSE) +
    geom_smooth(method = "lm", se = FALSE, color = "red")
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

Exercise 2: Fixing the model

The second plot (showing the model with squared temperature) follows the natural curve in the trend better.

bikes <- bikes %>% 
    mutate(temp_feel_squared = temp_feel^2)

# Plot the model WITHOUT squared temperature
ggplot(bikes, aes(x = temp_feel, y = riders_registered)) + 
    geom_point() + 
    geom_smooth(se = FALSE) +
    geom_smooth(method = "lm", se = FALSE, color = "red")
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

# Plot the model WITH squared temperature
ggplot(bikes, aes(x = temp_feel, y = riders_registered)) + 
    geom_point() + 
    geom_smooth(se = FALSE) +
    geom_smooth(method = "lm", formula = y ~ x + I(x^2), se = FALSE, color = "red")
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Exercise 3: Residual plots

The first residual plot (from the model with just a straight line trend) shows a lingering trend in the residuals—the blue curve traces the trend in the residuals, and it does not lie flat on the y = 0 line.

On the other hand, the second residual plot (from the model which uses a squared term to allow for curvature) shows very little trend in the residuals—the blue curve is almost flat on the y = 0 line.

# Fit a linear model
bike_mod1 <- lm(riders_registered ~ temp_feel, data = bikes)
# Fit a quadratic model
bike_mod2 <- lm(riders_registered ~ temp_feel + temp_feel_squared, data = bikes)

# Check out the residual plot for bike_mod1 (the incorrect model)
ggplot(bike_mod1, aes(x = .fitted, y = .resid)) + 
    geom_point() + 
    geom_hline(yintercept = 0) +
    geom_smooth(se = FALSE)
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

# Construct the residual plot for bike_mod2 (the good model)
ggplot(bike_mod2, aes(x = .fitted, y = .resid)) + 
    geom_point() + 
    geom_hline(yintercept = 0) +
    geom_smooth(se = FALSE)
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Exercise 4: Another example of an incorrect model

# Import the data
library(fivethirtyeight)
data(bechdel)

# Get only 1997 movies
movies_1997 <- bechdel %>% 
    filter(year == 1997)

# Construct the model
bechdel_model <- lm(intgross ~ budget, movies_1997)
# Scatterplot of earnings and budget with linear and curved trend lines
ggplot(movies_1997, aes(x = budget, y = intgross)) +
    geom_point() +
    geom_smooth(se = FALSE) +
    geom_smooth(method = "lm", se = FALSE, color = "red")
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

# Residual plot for bechdel_model
ggplot(bechdel_model, aes(x = .fitted, y = .resid)) +
    geom_point() +
    geom_hline(yintercept = 0) +
    geom_smooth(se = FALSE)
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

  1. From the scatterplot, we can see that there is one movie that is a massive outlier in both budget and earnings, and this outlier is pulling up the trend line that makes the model for “regular” movies that have budgets and earnings in “normal” ranges.

  2. The outlier movie is Titanic:

# One of many ways to filter to find the outlier movie!
movies_1997 %>% 
    filter(intgross > 2000000000)
# A tibble: 1 × 15
   year imdb      title   test  clean_test binary budget domgross intgross code 
  <int> <chr>     <chr>   <chr> <ord>      <chr>   <int>    <dbl>    <dbl> <chr>
1  1997 tt0120338 Titanic ok    ok         PASS      2e8   6.59e8   2.19e9 1997…
# ℹ 5 more variables: budget_2013 <int>, domgross_2013 <dbl>,
#   intgross_2013 <dbl>, period_code <int>, decade_code <int>

Exercise 5: Is the model strong? Developing R-squared intuition

The R-squared metric is a way to quantify the strength of a model. It measures how much variation in the outcome/response variable can be explained by the model.

Where does R-squared come from? Well, it turns out that we can partition the variance of the observed response values into the variability that’s explained by the model (the variance of the predictions) and the variability that’s left unexplained by the model (the variance of the residuals):

\[\text{Var(observed) = Var(predicted) + Var(residuals)}\]

“Good” models have residuals that don’t deviate far from 0. So the smaller the variance in the residuals (thus larger the variance in the predictions), the stronger the model. Take a look at the picture below and write a few sentences addressing the following:

  • The first row corresponds to the weaker model. We can tell because the points are much more dispersed from the trend line than in the second row. Recall that the correlation metric measures how closely clustered points are about a straight line of best fit, so we would expect the correlation to be lower for the first row than the second row.
  • The variance of the residuals is much lower for the second row—the residuals are all quite small. This indicates a stronger model.

Exercise 6: R-squared Interpretations

summary(bike_mod1)

Call:
lm(formula = riders_registered ~ temp_feel, data = bikes)

Residuals:
    Min      1Q  Median      3Q     Max 
-3607.1  -959.2  -153.8   998.2  3304.8 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -667.916    251.608  -2.655  0.00811 ** 
temp_feel     57.892      3.306  17.514  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1310 on 729 degrees of freedom
Multiple R-squared:  0.2961,    Adjusted R-squared:  0.2952 
F-statistic: 306.7 on 1 and 729 DF,  p-value: < 2.2e-16

Multiple R-squared: 0.2961

Interpretation: 29.61% of the variation in number of registered riders on any given day can be explained by the variation in temperature (specifically, what temperature it “feels” like it is).

Exercise 7: Further exploring R-squared

In this exercise, we’ll look at data from a synthetic dataset called Anscombe’s quartet. Load the data in as follows, and look at the first few rows:

data(anscombe)

# Look at the first few rows
head(anscombe)
  x1 x2 x3 x4   y1   y2    y3   y4
1 10 10 10  8 8.04 9.14  7.46 6.58
2  8  8  8  8 6.95 8.14  6.77 5.76
3 13 13 13  8 7.58 8.74 12.74 7.71
4  9  9  9  8 8.81 8.77  7.11 8.84
5 11 11 11  8 8.33 9.26  7.81 8.47
6 14 14 14  8 9.96 8.10  8.84 7.04

All of these models have close to the same intercept, slope, and R-squared!

anscombe_mod1 <- lm(y1 ~ x1, data = anscombe)
anscombe_mod2 <- lm(y2 ~ x2, data = anscombe)
anscombe_mod3 <- lm(y3 ~ x3, data = anscombe)
anscombe_mod4 <- lm(y4 ~ x4, data = anscombe)

summary(anscombe_mod1)

Call:
lm(formula = y1 ~ x1, data = anscombe)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.92127 -0.45577 -0.04136  0.70941  1.83882 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)   3.0001     1.1247   2.667  0.02573 * 
x1            0.5001     0.1179   4.241  0.00217 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.237 on 9 degrees of freedom
Multiple R-squared:  0.6665,    Adjusted R-squared:  0.6295 
F-statistic: 17.99 on 1 and 9 DF,  p-value: 0.00217
summary(anscombe_mod2)

Call:
lm(formula = y2 ~ x2, data = anscombe)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.9009 -0.7609  0.1291  0.9491  1.2691 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)    3.001      1.125   2.667  0.02576 * 
x2             0.500      0.118   4.239  0.00218 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.237 on 9 degrees of freedom
Multiple R-squared:  0.6662,    Adjusted R-squared:  0.6292 
F-statistic: 17.97 on 1 and 9 DF,  p-value: 0.002179
summary(anscombe_mod3)

Call:
lm(formula = y3 ~ x3, data = anscombe)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.1586 -0.6146 -0.2303  0.1540  3.2411 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)   3.0025     1.1245   2.670  0.02562 * 
x3            0.4997     0.1179   4.239  0.00218 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.236 on 9 degrees of freedom
Multiple R-squared:  0.6663,    Adjusted R-squared:  0.6292 
F-statistic: 17.97 on 1 and 9 DF,  p-value: 0.002176
summary(anscombe_mod4)

Call:
lm(formula = y4 ~ x4, data = anscombe)

Residuals:
   Min     1Q Median     3Q    Max 
-1.751 -0.831  0.000  0.809  1.839 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)   3.0017     1.1239   2.671  0.02559 * 
x4            0.4999     0.1178   4.243  0.00216 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.236 on 9 degrees of freedom
Multiple R-squared:  0.6667,    Adjusted R-squared:  0.6297 
F-statistic:    18 on 1 and 9 DF,  p-value: 0.002165

But when we look at the scatterplots, they all look substantially different, and we would want to approach our modeling differently for each one:

  • x1 and y1: A linear model seems appropriate for this data.
  • x2 and y2: The scatterplot is clearly curved—a “linear” regression model with squared terms, for example, would be more appropriate for this data. (We’ll talk more about ways to handle nonlinear relationships soon!)
  • x3 and y3: There is a very clear outlier at about x3 = 13 that we would want to dig into to better understand the context. After that investigation, we might consider removing this outlier and refitting the model.
  • x4 and y4: There is clearly something strange going on with most of the cases having an x4 value of exactly 8. We would not want to jump straight into modeling. Instead, we should dig deeper to find out more about this data.
ggplot(anscombe, aes(x = x1, y = y1)) +
    geom_point() + 
    geom_smooth(method = "lm", color = "red", se = FALSE)
`geom_smooth()` using formula = 'y ~ x'

ggplot(anscombe, aes(x = x2, y = y2)) +
    geom_point() + 
    geom_smooth(method = "lm", color = "red", se = FALSE)
`geom_smooth()` using formula = 'y ~ x'

ggplot(anscombe, aes(x = x3, y = y3)) +
    geom_point() + 
    geom_smooth(method = "lm", color = "red", se = FALSE)
`geom_smooth()` using formula = 'y ~ x'

ggplot(anscombe, aes(x = x4, y = y4)) +
    geom_point() + 
    geom_smooth(method = "lm", color = "red", se = FALSE)
`geom_smooth()` using formula = 'y ~ x'

Exercises 8 - 11

No solutions for these exercises. These require longer discussions, not discrete answers.